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Abstract 

Newspapers  are  rich  records  of  U.S.  history.  Due  to  the  deterioration  of  older  newspapers, 
the  National  Endowment  for  the  Humanities  is  archiving  f9th  century  newspapers  on 
microhlm.  Although  microhlm  is  a  good  preservation  method,  it  provides  limited  access 
to  researchers  and  the  general  public.  We  are  building  a  system  to  provide  universal 
access  to  digital  images  and  full-text  content  of  historical  newspapers.  The  system  has 
three  main  components:  (a)  An  Optical  Character  Recognition  (OCR)  module  that 
converts  digitized  images  into  searchable  text  and  identihes  regions,  (b)  An  Information 
Retrieval  module  that  applies  linguistic  information  to  aid  in  segmentation,  indexing, 
and  retrieval  of  the  noisy  OCR’d  text,  (c)  A  User  Interface  module  that  allows  historians 
and  educators  to  query  and  view  retrieved  documents.  Thus  far,  we  have  developed  two 
OCR  techniques  targeted  to  processing  historical  newspapers  and  we  have  built  a  user 
interface  to  search  the  OCR  output  and  superimpose  matches  on  a  page  image  from  the 
newspaper. 


This  research  was  funded  in  part  by  the  Department  of  Defense  and  the  Army  Research  Laboratory 
under  Contract  MDA  9049-6C-1250. 
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Abstract 

Newspapers  are  rich  records  of  U.S.  history.  Due  to  the  deterioration  of  older  newspapers, 
the  National  Endowment  for  the  Humanities  is  archiving  f9th  century  newspapers  on 
microhlm.  Although  microhlm  is  a  good  preservation  method,  it  provides  limited  access 
to  researchers  and  the  general  public.  We  are  building  a  system  to  provide  universal 
access  to  digital  images  and  full-text  content  of  historical  newspapers.  The  system  has 
three  main  components:  (a)  An  Optical  Character  Recognition  (OCR)  module  that 
converts  digitized  images  into  searchable  text  and  identihes  regions,  (b)  An  Information 
Retrieval  module  that  applies  linguistic  information  to  aid  in  segmentation,  indexing, 
and  retrieval  of  the  noisy  OCR’d  text,  (c)  A  User  Interface  module  that  allows  historians 
and  educators  to  query  and  view  retrieved  documents.  Thus  far,  we  have  developed  two 
OCR  techniques  targeted  to  processing  historical  newspapers  and  we  have  built  a  user 
interface  to  search  the  OCR  output  and  superimpose  matches  on  a  page  image  from  the 
newspaper. 


This  research  was  funded  in  part  by  the  Department  of  Defense  and  the  Army  Research  Laboratory 
under  Contract  MDA  9049-6C-1250. 


1  Introduction 


Newspapers  are  rich  records  of  U.S.  history.  Due  to  the  deterioration  of  older  newspa¬ 
pers  the  National  Endowment  for  the  Humanities  is  archiving  t9th  century  newspapers 
on  microhlm.  Although  microhlm  is  a  good  preservation  method,  it  provides  limited 
access  for  researchers  and  the  general  public.  We  propose  to  build  a  system  to  convert 
microhlmed  historic  newspapers  into  both  digital  image  and  full-text  electronic  archives, 
thereby  providing  universal  access  to  their  content  over  the  web. 

The  extraction  of  text  from  the  page  images  presents  a  wide  range  of  research  issues. 
To  meet  the  research  challenges  of  this  project  we  have  assembled  an  interdisciplinary 
team  with  expertise  in  document  image  analysis,  user-interface  design,  information  re¬ 
trieval,  system  integration,  cross-language  retrieval,  library  metadata  standards,  history, 
and  journalism.  Some  of  the  problems,  for  example  the  extraction  of  entire  newspa¬ 
per  articles  from  zones  of  text,  have  not  been  considered  in  any  of  the  related  research 
communities. 

Several  other  projects  have  developed  collections  of  digitized  newspapers,  journals, 
and  personal  letters.  Perhaps  the  best  known  of  these  is  JSTOR  [34]  which  has  digitized 
page  images  of  scholarly  journals.  While  JSTOR  does  OCR  of  its  content,  the  original 
material  is  fairly  straightforward  compared  to  the  historic  newspapers  we  plan  to  index. 
The  “Valley  of  the  Shadow”  project  [35],  which  examined  primary  source  material  from 
a  pair  of  communities  involved  in  the  Civil  War,  did  not  include  the  full  text  of  many 
news  stories  but  had  only  abstracts  of  those  stories.  Our  tools  could  greatly  enrich  that 
corpus  by  providing  direct  OCR  of  the  newspapers. 

1.1  Brief  History  of  Newspapers  as  Relevant  to  OCR 

Publick  Occurrences,  Both  Forreign  and  Domestick,  the  hrst  newspaper  to  be  published 
in  North  America,  was  printed  in  Boston  in  1690.  This  newspaper  was  immediately 
banned  by  the  Governor  of  Massachusetts  —  non-official  publications  were  not  allowed 
in  colonial  America.  After  this  initial  setback,  newspapers  flourished  in  North  America. 
In  1704,  the  Boston  Newsletter  became  the  hrst  official  newspaper  to  be  published  and 
replaced  the  official  pamphlets,  newsletters  and  proclamations.  By  the  Revolution,  there 
were  37  different  newspaper  titles.  Many  newspapers  like  Pennsylvania  Packet  and  Daily 
Advertiser  then  started  publishing  daily  and  their  titles  rehected  their  new  revenue  source. 
Most  colonial  editorials  tried  to  chronicle  the  historical  events  that  led  to  the  creation  of 
the  new  nation  but  in  the  process  became  quite  partisan.  The  New  York  Plerald,  which 
was  established  in  1835,  was  the  hrst  newspaper  to  to  be  advertised  as  a  politically 
independent  newspaper. 

Technological  advances  in  printing  helped  in  increasing  the  production  and  improving 
the  quality  of  newspapers.  Prior  to  1814,  hand-operated  wooden  presses  were  used  for 
printing  newspapers,  books,  magazines,  and  pamphlets.  The  invention  of  the  steam- 
driven  “double-press”  increased  the  rate  of  production  to  5,000  copies  per  hour.  This 
rate  increased  in  1865  with  the  invention  of  the  rotary  press.  The  quality  of  print  was 
affected  by  the  inking  methods  (automatic  or  manual),  printing  plates,  type  of  paper, 
etc.  The  typesetting  up  to  that  point  was  done  character  by  character.  That  is,  each 
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individual  character  was  typeset  manually  on  a  matrix  and  then  the  entire  typeset  page 
was  printed.  Setting  individual  characters  led  to  non-uniformity  in  the  orientation  and 
location  of  characters. 

A  major  change  in  newspaper  print  quality  occurred  in  the  late  19th  century  when  the 
Linotype  printing  machine  was  invented  by  Ottmar  Mergenthaler.  This  machine  allowed 
the  typesetter  to  typeset  and  cast  an  entire  line  of  type  using  a  keyboard.  The  Linotype 
improved  the  quality  of  print  and  also  increased  the  rate  of  typesetting  of  newspapers. 
The  number  of  American  newspapers  increased  dramatically  from  1880  to  1900  from  850 
titles  to  2000.  In  1886  the  New  York  Tribune  was  one  of  the  hrst  newspapers  to  use  the 
Linotype  for  printing.  The  drawback  of  Linotype  machines  was  that  they  did  not  allow 
kerning  of  letters  (unless  compound  matrices  were  used)  and  so  the  italic  letter  J’  had 
a  stunted  head  and  tail.  Monotype  printing  machines  were  invented  at  the  same  time 
by  Tolbert  Lanston  in  1887  in  competition  with  the  Linotype.  Monotype  machines  were 
similar  to  Linotype  machines  except  that  individual  characters  could  be  typeset  from  the 
keyboard.  The  hnal  matrix  had  individual  characters  and  so  it  was  possible  to  manually 
introduce  kerning  and  typeset  large  display  characters  like  dropcaps  and  raised  caps. 

Page  layout  has  changed  over  time.  In  early  1800s  newspapers  were  typeset  quite 
closely  with  very  little  space  between  lines  and  columns  and  used  very  small  font  sizes. 
The  point  size  of  the  type  used  has  not  changed  much  since  the  early  1900s.  After  the 
1900s  older  Gothic  fonts  gave  way  to  newly  designed  fonts  like  Cheltenham;  and  headlines 
started  appearing  in  Bodoni  font.  Line-drawings  and  cartoons  started  appearing  by 
the  1870s.  These  were  either  carved  on  wooden  blocks,  or  etched  on  zinc  plates.  The 
photoengraving  process  was  invented  in  the  1860s  in  England  and  was  perfected  by 
Federic  E.  Ives  of  Cornell  University  in  1886.  The  photoengraving  process  was  adapted 
for  the  rotary  presses  used  for  printing  newspapers  in  1880.  In  1897,  The  New  York 
Tribune  was  the  hrst  newspaper  to  start  printing  halftone  reproductions  of  photographs. 

In  the  20th  century,  one  of  the  most  signihcant  changes  occurred  in  1946  when  the 
U.S.  Government  Printing  Ofhce  developed  the  Intertype  Fotosetter.  This  was  a  line¬ 
casting  machine  containing  brass  matrices  with  him  negatives  of  characters.  Images  were 
produced  on  photosensitive  paper  from  which  printing  plates  were  created.  In  1950  the 
Photo  200  machine  was  invented,  ft  had  a  spinning  him  matrix  containing  all  characters 
in  a  font  and  a  stroboscopic  light  source  was  used  to  print  on  photo-sensitive  paper. 
For  additional  information  about  the  history  of  printing,  typography,  and  newspaper 
publishing  see  [9,  30,  8]. 

1.2  System  Overview 

Our  project  will  develop  tools  to  process  large  quantities  of  digitized  newspaper  images 
automatically.  There  are  several  reasons  we  feel  that  this  effort  is  now  feasible  and 
desirable.  We  have  access  to  prototype  OCR  research  software  and  we  have  considerable 
expertise  that  will  allow  us  to  extend  both  OCR  and  IR  research  on  these  topics.  In 
addition,  the  conhuence  of  greater  storage  and  processing  capacity  with  acceptance  of 
the  web  for  distribution  of  content  provides  a  suitable  infrastructure  for  acceptance  of 
the  system. 

We  have  begun  to  develop  our  techniques  and  to  build  a  small  corpus  by  working 


2 


from  a  paper  copy  of  an  original  paper  copy  of  The  Brooklyn  Eagle  for  November  ff, 
f9f  7  (Armistice  Day)  and  from  a  spool  of  microfilm  covering  that  issne  of  the  newspaper. 
Given  the  75-year  copyright  dnration,  we  wanted  to  hnd  a  newspaper  from  before  f923 
and  Armistice  Day  had  a  variety  of  news  stories  inclnding  news  abont  World  War  1  as 
well  as  descriptions  of  Sntfragette  marches.  The  cnrrent  system  inclndes  several  modnles 
—  OCR,  search,  and  interface. 

2  Research 

2.1  Optical  Character  Recognition 


Fignre  1:  Original  scanned  image  of  the  Brooklyn  Daily  Eagle. 

The  Optical  Character  Recognition  system  creates  symbolic,  searchable  text  from  scanned 
images  [22,  4].  While  there  are  nnmerons  commercial  OCR  prodncts,  most  of  them  fail 
to  recognize  text  in  highly  degraded  newspaper  images.  The  main  canses  for  deteriora¬ 
tion  of  performance  are  (i)  joined  and  broken  characters,  (ii)  salt  and  pepper  noise,  (iii) 
character-to-character  variation  dne  to  old  typesetting  technology,  (iv)  page-bending  at 
the  spine  of  the  newspaper,  (v)  the  very  small  gaps  between  colnmns  and  lines  of  text, 
(vi)  line  separators  and  black  ontlines  aronnd  advertisements,  (vii)  a  wide  variety  of  fonts, 
and  (viii)  paper  aging  and  degradation. 

We  have  bnilt  a  prototype  OCR  system  nsing  a  commercial  development  kit  from 
Caere  Corporation.  The  system  cnrrently  does  not  provide  ns  with  segmented  zones 
bnt  provides  ns  with  word  bonnding  boxes,  text  strings  within  the  boxes,  and  OCR 
conhdence  levels.  This  ontpnt  is  cnrrently  being  nsed  by  the  IR  system  to  locate  the 
search  resnlts  on  the  page  image.  The  recognition  accnracy  level  on  historic  documents 
is  lower  than  what  the  system  achieves  on  new  newspapers.  We  are  in  the  process  of 
creating  a  benchmarking  dataset  and  condncting  an  OCR  accnracy  evalnation  nsing  the 
methodology  ontlined  in  [21].  While  the  OCR  resnlts  are  not  on  a  par  with  what  one 
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Figure  2:  The  Brooklyn  Daily  Eagle  after  being  processed  with  the  line-removal  algorithm. 


gets  on  newspaper  documents,  the  IR  results  have  shown  that  it  is  not  necessary  to  have 
very  high  accuracy  for  reasonable  precision  and  recall  (e.g.,  [6]). 

Because  our  goal  is  to  eventually  retrieve  articles,  we  need  noise-removal  algorithms, 
robust  zone  segmentation  algorithms,  and  higher-level  post-processing.  We  will  approach 
the  problem  of  noise  filtering  by  first  creating  validated  models  of  the  microfilm  degrada¬ 
tion  process  and  the  corresponding  parameter-estimation  algorithms  [20,  t9,  tO],  These 
models  will  then  be  used  to  design  noise-removal  and  image-restoration  algorithms.  The 
impact  of  the  noise-removal  algorithm  will  be  evaluated  using  a  benchmarking  dataset. 
Zone  segmentation  algorithms  can  be  very  sensitive  to  inter-column  and  inter-line  gaps. 

In  the  next  section,  we  describe  a  morphological  line-removal  algorithm  that  filters 
out  lines  from  an  image  before  it  is  processed  by  a  page  segmentation  algorithm.  This  is 
followed  by  an  adaptive  X-Y  cut  algorithm  [25]  to  create  blocks  of  homogeneous  regions, 
which  are  then  classified  into  text  and  non-text  regions  using  a  statistical  decision  tree. 

2.1.1  Line  Removal 

Mathematical  morphology  is  an  image  processing  approach  based  on  set  theory  [26,  33, 
t2].  Images  are  considered  as  pointsets  and  dilation,  erosion,  closing,  and  opening  are 
performed  on  the  images  to  extract  features  and  find  spatial  relations  among  the  detected 
features  [17],  The  standard  set  theory  operations  are  also  valid  operations  that  can  be 
performed  on  two  images. 

We  first  deskew  the  image  to  make  the  lines  parallel  to  the  vertical  and  horizontal 
axes.  Next  we  fill  in  breaks  in  vertical  and  horizontal  lines  by  performing  a  closing 
operation  with  vertical  and  horizontal  structuring  elements.  Next,  we  remove  speckle 
noise  by  opening  the  resulting  image  with  a  disk  structuring  element.  The  vertical  and 
horizontal  lines  are  detected  by  performing  an  opening  operation  with  long  vertical  and 
horizontal  structuring  elements.  The  detected  lines  are  then  subtracted  from  the  original 


4 


Figure  3:  Automatic  zone  segmentation  for  the  Brooklyn  Eagle  without  hrst  removing 
lines.  Notice  that  many  columns  are  merged. 


image.  For  details  of  the  algorithm  the  reader  is  referred  to  [f5,  f7,  12], 

In  Figure  f  we  show  the  original  image  of  a  historical  newspaper.  Most  of  the  lines 
are  vertical  or  horizontal.  In  Figure  2  we  show  the  result  of  our  line  removal  algorithm. 
Notice  that  not  all  lines  are  removed.  This  is  because  there  are  still  a  few  breaks  in  the 
lines  that  were  not  hlled  in  by  the  preprocessing  step. 

2.1.2  Zone  Segmentation 

ffistorical  newspapers  have  a  fairly  regular  layout.  However,  the  column  and  row  gaps 
are  typically  quite  narrow.  Furthermore,  black  bounding  boxes  around  advertisements 
and  line  separators  between  columns  are  the  main  causes  of  incorrect  page  segmentation. 
We  have  built  an  algorithm  that  hlters  the  lines  and  bounding  boxes  before  doing  OCR. 
Preliminary  results  show  that  the  hltering  algorithm  leads  to  proper  segmentation  and 
results  in  detection  of  words  that  were  earlier  not  recognized.  The  segmented  zones  will 
then  be  classihed  into  various  types  —  text  block,  heading,  title,  advertisement,  logo,  etc. 
—  using  a  statistical  decision  tree  [14,  31,  29,  36].  The  decision  tree  will  automatically 
construct  the  rules  for  minimum-error  classihcation.  The  construction  of  the  decision 
tree  requires  a  dataset  of  images  with  corresponding  manually  segmented  and  labeled 
zones.  A  benchmarking  dataset  will  be  created  for  this  purpose. 

Finally,  to  evaluate  the  performance  of  the  OCR  system,  we  will  create  a  benchmark¬ 
ing  image  dataset  with  corresponding  zone  and  character  groundtruth  [11,  16,  18,  21]. 
A  statistical,  stratihed  sampling  procedure  will  be  adopted  to  create  a  representative 
sample  of  images.  The  sample  will  be  representative  of  the  variation  in  font,  typesetting 
technology,  layout,  degradation,  microhlm,  etc.  This  dataset  will  be  split  into  two  parts. 
One  part  will  be  used  for  designing  noise-removal  algorithms  and  building  decision  trees, 
and  the  second  part  will  be  used  for  independent  evaluation. 
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Figure  4:  Automatic  zone  segmentation  for  the  Brooklyn  Eagle  after  removing  lines. 

2.2  Statistical  Language  Processing  and  Information  Retrieval 

There  have  been  extensive  studies  of  the  impact  of  degraded  OCR  on  retrieval  perfor¬ 
mance  [7].  Typically,  retrieval  is  fairly  robust  with  word-error  rates  of  up  to  about  30  or 
40%.  For  newspapers  from  the  late  19th  and  early  20th  century,  our  results  should  be 
safely  within  that  range. 

The  OCR  issues  merge  into  and  interact  with  higher-level  linguistic  concerns.  For 
instance,  we  need  to  determine  how  stories  are  continued  across  columns,  around  pictures, 
and  onto  inside  pages.  There  are  several  approaches  to  this  story- continuation  problem. 
First,  the  similarity  of  blocks  of  text  may  be  compared.  Second,  specihc  features  of  the 
text  may  be  analyzed.  For  instance,  a  sentence  fragment  at  the  end  of  one  section  of  the 
story  must  be  completed  in  the  continuation.  Story  continuation  is  itself  related  to  article 
extraction.  That  is,  how  successful  are  we  at  getting  intact  articles  and  at  identifying  all 
of  the  articles  on  a  page? 

Once  articles  have  been  identihed,  we  need  to  categorize  news  stories.  As  a  bench¬ 
mark,  we  will  compare  our  classihcation  of  news  stories  with  the  The  New  York  Times 
Index  and  we  will  conduct  studies  on  the  utility  of  the  categorizations  to  facilitate  end- 
user  access.  Furthermore,  users  might  want  to  navigate  through  a  collection  by  following 
threads  of  news  topics.  We  will  extend  our  categorization  techniques  to  determining 
those  threads.  In  a  modern  context,  similar  questions  have  begun  to  be  addressed  by  the 
Topic  Detection  and  Tracking  project  [1]. 

Beyond  using  linguistic  information  to  enhance  the  OCR,  it  is  the  core  of  the  retrieval 
process  and  it  can  also  be  used  for  linguistic  research.  For  instance,  consider  the  following 
sentence  from  the  Brooklyn  Eagle  for  November  ff,  1917: 
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Figure  5:  Prototype  of  the  user  interface.  Hits  of  the  search  string  “british”  are  indicated 
in  the  map  at  the  top  left  and  where  they  appear  in  the  page  image. 


The  United  States  Gov¬ 
ernment  will  listen  to  no 
armistice  proposal  such  as 
is  announced  to  be  the  pro¬ 
gram  of  Lenine  and  the  radi¬ 
cals  who  are  leading  the  new 
Revolution  in  Russia. 

In  addition,  journalistic  styles  changed  in  the  time  periods  covered  by  these  corpora. 
Specihcally,  early  news  stories  tended  to  follow  a  chronological  order  while  more  modern 
stories  follow  the  “pyramid”  structure  of  providing  several  layers  of  detail.  We  will 
develop  automatic  methods  for  detecting  those  differing  styles. 

We  anticipate  that  there  will  be  substantial  interest  in  tracking  names  of  people  and 
places  which  appear  in  the  newspapers.  This  could,  for  instance,  be  used  for  genealogical 
research.  We  will  examine  the  impact  of  OCR  errors  on  named-entity  identihcation. 


2.3  Metadata  and  Mark-Up 

Several  layers  of  metadata  are  required  for  this  project.  Some  of  these  are  already  es¬ 
tablished.  For  instance,  there  are  extensive  guidelines  for  cataloging  newspapers  [32]. 
However,  there  is  a  need  for  standard  descriptions  of  both  the  logical  structure  and  phys¬ 
ical  layout  of  newspaper  content.  These  goals  are  consistent  with  the  mark-up  standards 
of  the  Text  Fncoding  Initiative  (TFl)  [f3].  There  is  also  a  need  to  provide  metadata 
which  will  enhance  the  performance  of  the  OCR  techniques.  For  instance,  characteristics 
of  the  newspaper  style,  such  as  the  number  of  columns  across  the  page,  should  be  useful 
for  the  recognizer. 
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2.4  User  Interface 


A  newspaper  is  a  very  complex  and  highly  detailed  object.  Sophisticated  interfaces  will 
be  needed  to  allow  the  nsers  to  navigate  at  several  levels  of  grannlarity  —  within  a  single 
page  and  a  single  newspaper  bnt  also  across  issnes  of  the  newspaper,  and  eventnally 
across  collections  of  newspapers. 

Moreover,  we  need  to  snpport  two  very  different  types  of  nsers.  A  corpns-developer 
interface  will  be  bnilt  for  nsers  who  need  to  inspect  and  npdate  the  OCR  and  zoning. 
It  is  likely  that  early  versions  of  the  software  will  have  signihcant  nnmbers  of  errors  and 
that  hand  tnning  will  be  reqnired  to  obtain  a  nsefnl  corpns. 

The  end-nser  interface  will  allow  researchers  and  stndents  to  access  the  collections. 
This  interface  will  allow  nsers  to  search  terms  and  phrases  derived  from  the  OCR  and 
then  to  view  the  search  hits  snperimposed  on  the  page  image.  As  shown  in  Fignre  5, 
we  have  been  experimenting  with  the  nse  of  thnmbnail  maps  to  provide  navigation  and 
landmarks  for  the  fnll-page  image. 

As  snggested  in  the  discnssion  of  IR  techniqnes,  there  may  be  many  ways  to  navigate 
throngh  large  corpora:  For  instance,  by  tracking  names  of  people  or  by  following  threads 
of  news  stories.  We  will  develop  graphical  timeline  interfaces  [23]  to  help  the  nser  keep 
oriented. 

3  Discussion 

3.1  Richer  Corpora 

To  evalnate  the  performance  of  the  OCR  system  and  the  noise-removal  algorithms  we 
reqnire  a  representative  sample  of  the  newspapers.  We  will  create  testing  and  training 
datasets  of  scanned  images  that  will  represent  the  varions  type  of  fonts,  typesetting 
technologies,  layont,  degradation,  and  microhlm.  Initially,  we  will  create  representative 
samples  of  the  images  for  the  collections  of  newspapers  we  will  work  with.  Then  we 
will  create  datasets  that  are  representative  of  a  larger  popnlation  of  newspapers.  One 
microhlm  roll  contains  approximately  two  weeks  of  a  newspaper  with  abont  40  pages  in 
each  issne.  Microhlm  scanners  can  scan  one  roll  at  600  dpi  in  approximately  one  honr. 
The  creation  of  the  scanned  image  collection  will  take  approximately  three  months  at 
eight  honrs  per  day. 

We  will  digitize  the  Negro  Newspaper  Collection  [24],  which  is  available  from  the 
Library  of  Congress,  and  consists  of  180  microhlm  rolls.  For  example,  onr  proposed  in¬ 
dexable  newspaper  collection  of  Reconstrnction  Fra  (1863-1877)  African-American  news¬ 
papers  wonld  allow  historians  qnickly  to  check  hypotheses  and  search  for  references  to 
specihc  individnals. 

We  will  also  digitize  a  collection  of  Pennsylvania  Dntch  German-langnage  immigrant 
newspapers  for  cross-langnage  research.  These  inclnde  the  Reading  Adler^  Der  Deustsche 
Porcupinen  und  Lancaster  Anzeigs-Nachrichten  Deutsche  Porcupinen,  and  Neue  Unpar- 
tyische  Readingen  Zeitung  und  Anzeigs-Nachrichten.  Beyond  these  niche  collections,  we 
will  also  index  a  large-city  newspaper  from  page  images  becanse  of  the  wide  range  of 
issnes  it  covers.  We  have  chosen  the  The  New  York  Times  becanse  of  its  mix  of  national 
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and  international  stories  as  well  as  its  coverage  (which  is,  perhaps,  spotty  in  some  cases) 
of  the  diverse  ethnic  gronps  in  New  York  City.  We  will  scan  15  years  of  The  New  York 
Times  (approximately  400  microhlm  rolls)  to  stndy  issnes  snch  as  the  history  of  women 
and  the  impact  of  advertisements  on  society. 

3.2  History  and  Journalism  Research  and  Education 

We  are  working  with  historians  whose  research  covers  the  topics  to  be  examined  in  this 
project  snch  as  the  African-American  experience  dnring  the  Civil  War  and  Reconstrnction 
[3],  Women’s  rights  in  the  late  19th  and  early  20th  centnry  [27,  28],  and  immigrant  issnes. 
Becanse  we  expect  the  historians  to  be  consistent  and  dedicated  nsers  of  these  resonrces, 
we  will  focns  on  providing  snpport  for  extended  interaction.  For  instance,  we  will  develop 
techniqnes  for  them  to  add  annotations.  The  researchers  will  be  monitored  in  the  nse  of 
the  tools.  Fnrthermore,  observation  sessions  will  be  established  in  which  the  nsers  will 
be  asked  to  to  “think  alond”  as  they  interact  with  the  tools. 

Several  of  these  historians  and  jonrnalists  (e.g.,  [2])  have  particnlar  interest  in  nnder- 
gradnate  edncation.  While  professional  historians  will  often  painstakingly  examine  large 
qnantities  of  primary  research  material,  stndents  have  neither  the  time  nor  the  patience 
for  that  type  of  research.  Online  search  and  access  to  primary  historical  sonrces  shonld 
greatly  facilitate  stndents’  nse  of  that  material. 

The  projects  will  teach  stndents  the  skills  involved  in  doing  primary  historical  re¬ 
search,  and  the  conrses  will  inclnde  a  snrvey  of  United  States  history,  a  snrvey  of  U.S. 
women’s  history,  and  specialized  conrses  on  gender  and  on  progressive  reform  in  the 
early  20th  centnry.  The  gradnate  assistants  will  not  only  help  determine  the  feasibility 
of  varions  projects  bnt  will  also  teach  stndents  the  technical  aspects  of  searching  digitized 
newspapers. 

These  edncational  tools  will  be  formally  evalnated  by  dividing  the  stndents  into  two 
sections.  In  one  section,  the  stndents  will  do  research  nsing  traditional  methods;  in  the 
second  section,  they  will  be  allowed  to  nse  the  research  prototype.  The  reports  prepared 
by  these  two  sets  of  stndents  will  be  graded  by  independent  readers  and  the  qnality 
compared  across  the  two  gronps.  The  resnlts  will  be  disseminated  to  other  historians 
throngh  the  Organization  of  American  Historians. 

Inevitably,  there  will  be  errors  in  the  OCR  and  lingnistic  processing.  While  the 
presence  of  these  errors  is  not  ideal  for  the  historians,  the  benehts  of  ease  of  indexing 
ontweigh  the  limitations  of  the  errors.  Natnrally,  it  is  a  goal  of  onr  OCR  research  to 
minimize  errors,  bnt  we  will  also  provide  mechanisms  for  qnality  control  and  corpns 
revisions.  Fnrthermore,  we  will  freely  report  the  freqnency  of  varions  types  of  errors  so 
that  nsers  can  be  aware  of  them. 

3.3  Further  Research  Issues 

Additional  scientihc  qnestions  we  plan  to  address  inclnde: 

•  Statistical  stratihed  sampling  methods  [5]  will  be  nsed  to  create  a  representative 
image  dataset  with  gronndtrnth,  which  will  be  nsed  as  a  benchmark  for  performance 
evalnation  of  OCR  systems. 
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•  Because  much  of  immigrant  history  is  archived  in  non- English-language  newspa¬ 
pers,  we  will  adapt  the  interface  and  search  tools  for  cross-language  information 
retrieval.  We  will  work  with  a  Pennsylvania  Dutch  German-language  newspaper 
collection  which  is  also  of  interest  to  the  project  historians. 

•  Query-by-image-content:  Many  image  regions  may  not  be  textual,  such  as  logos 
and  text  in  unusual  fonts.  We  will  allow  image-based  search  for  such  regions  using 
QBIC-like  search  tools  [10]. 

•  Beyond  basic  text  processing,  the  system  should  also  be  able  to  identify  and  pro¬ 
vide  access  to  the  wide  variety  of  material  included  in  newspapers  such  as  adver¬ 
tisements.  This  might  be  useful,  for  instance,  in  studying  the  changing  image  of 
women  portrayed  in  those  advertisements. 
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